Experience-driven formation of parts-based representations in a model of 

layered visual memory 



Jenia Jitsev 1 ' 2 '* and Christoph von der Malsburg 1 

1 Frankfurt Institute of Advanced Studies, Frankfurt Am Main, Germany 
2 Johann Wolfgang Goethe University, Frankfurt Am Main, Germany 



Running Title: 

Formation of layered visual memory. 

a 

Correspondence : 

^ Jenia Jitsev 

Frankfurt Institute for Advanced Studies (FIAS), 

, , Ruth-Moufang-Str. 1, 60438 Frankfurt am Main, Germany 

{J jitsev@fias.uni-frankfurt.de 

d 

X> 

CO 

> 
in 

(N 

<N 

in 
o 

On 
O 



Abstract 



Growing neuropsychological and neurophysiological evidence suggests that the visual cortex uses parts- 
based representations to encode, store and retrieve relevant objects. In such a scheme, objects are repre- 
sented as a set of spatially distributed local features, or parts, arranged in stereotypical fashion. To encode 
the local appearance and to represent the relations between the constituent parts, there has to be an appro- 
priate memory structure formed by previous experience with visual objects. Here, we propose a model 
how a hierarchical memory structure supporting efficient storage and rapid recall of parts-based represen- 
tations can be established by an experience-driven process of self-organization. The process is based on 
the collaboration of slow bidirectional synaptic plasticity and homeostatic unit activity regulation, both 
running at the top of fast activity dynamics with winner-take-all character modulated by an oscillatory 
rhythm. These neural mechanisms lay down the basis for cooperation and competition between the dis- 
tributed units and their synaptic connections. Choosing human face recognition as a test task, we show 
that, under the condition of open-ended, unsupervised incremental learning, the system is able to form 
memory traces for individual faces in a parts-based fashion. On a lower memory layer the synaptic struc- 
ture is developed to represent local facial features and their interrelations, while the identities of different 
persons are captured explicitly on a higher layer. An additional property of the resulting representations 
is the sparseness of both the activity during the recall and the synaptic patterns comprising the memory 
traces. 
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1. Introduction 



A working hypothesis of cognitive neuroscience states that the higher functions of the brain require co- 
ordinated interplay of multiple cortical areas distributed over the brain-wide network. For instance, the 
mechanisms of memory are thought to be subserved by various cortical and subcortical regions, including 
the medial temporal lobe (MTL), inferior temporal (IT) and prefrontal (PFC) cortex areas (Fusterj 1997 



Miyashita, 2004) to name only few of them prominent in the function of the visual memory. Studies of 



information processing going on in the course of encoding, consolidation and retrieval of visual represen- 
tations reveal a hierarchical organization, sparse distributed activity and massive recurrent communication 



within the memory structure (Tsao et al. , 2006 Konen and Kastner , 2008 ; Osada et al. 2008 1. Here we 



focus our attention on developmental issues and discuss the process of self-organization that may lead 
to the formation of the core structure responsible for flexible, rapid and efficient memory function, with 
organizational properties as inferred from the experimental works. 

It is widely held that processes responsible for memory formation rely on activity-dependent modifica- 
tion of the synaptic transmission and on regulation of the intrinsic properties of single neurons (|Miyashitaj 



1988, Bear, 1996, Zhang and Linden[ 2003 1. However, it is far from clear how these local processes could 



be orchestrated for memorizing complex visual objects composed of many spatially distributed subparts 
arranged in stereotypic relations. In mature cortex, there is strong evidence for a basic vocabulary of 



shape primitives and elementary object parts in the TEO and TE areas of posterior and anterior IT (Fujita 



et al.[ |1992[ |Tanaka[ |2003| ) as well as for identity and category specific neurons in anterior IT, PFC and 



hippocampus (Freedman et al. , 2003 [Quiroga et aL| |2005 ). Further findings indicate that the encoding of 
visual objects involves the formation of sparse clusters of distributed activity across the processing hier- 
archy within inferior temporal cortex (Tsunoda et al. 200 1} Reddy and Kanwisher[ 2006[ ). This seems to 
be a neuronal basis for the parts-based representation that the visual system employs to construct objects 
from their constituent part elements (Ullman et al. 2002[ [Hay worth an d BiedermanJ 2006). 

In the light of these findings, we may ask ourselves whether the observed memory organization hap- 
pens to be the outcome of a self-organization process that would have to find solution to a number of 
developmental tasks. To provide a neural substrate for the parts-based representation, memory traces have 
to be formed and maintained in an unsupervised fashion to span the basic vocabulary for the visual ele- 
ments and to define associative links between them. Subsets of associatively linked complex features can 
then be interpreted as coherent objects composed of the respective parts. As there is a virtually unlimited 
number of visual objects in the environment, the limited resources spent on formation of these memory 
traces have to be carefully allocated to avoid unfavorable interference effects and information loss caused 
by potential memory content overlap. Thus, the system is permanently confronted with the problem of 
selecting the right small population out of the totally available, potentially conflicting synaptic facilities 
which has to be modified for acquisition and consolidation of a novel stimulus. Moreover, if objects stored 
in memory are supposed to share common parts, a regulation mechanism would be required to balance 
the usage load of part-specific units and minimize the interference, reassuring their optimal participation 
in memory content formation and encoding. Another issue is the timing of the modifications, which have 
to be coordinated properly if the correct relational structure of distributed parts constituting the object's 
identity is to be stored in the memory. 

The same selection problem arises on the fast time scale, during memory recall or for encoding of 
a novel object. Currently, there is a broad agreement on the sparseness of the activity patterns evoked 
by the presentation of a complex visual object, where only a small fraction of the available neurons in 



the higher visual cortex participate in the stimulus-related response (Rolls and Tovee, 1995 Olshausen 
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and Field} [2004; Quiroga et al., 2008). In the context of the parts-based representation scheme, one 



possible interpretation of sparse activation would be the selection of few parts from a large overcomplete 
vocabulary for the composition of the global visual object. Considering the speed of object recognition 
measured in psychophysical experiments on humans and primates ( |Thorpe and Fabre-Thorpe[ 2001[ ), there 
have to be neural mechanisms allowing this selection procedure to happen within the very short time of 
a few hundred milliseconds. Moreover, if relations are to be represented by dynamic assemblies of co- 
activated part-specific neurons, such a combinatorial selection would require clear unambiguous temporal 
correlations between the constituent neurons to identify them and only them as being part of the same 
assembly encoding the object ( von der Malsburg[ 1999, Singer, 1999[ ). 

Hypothesizing that the process of neural resource selection and its coordination across distributed 
units is a crucial ingredient for successful structure formation and learning, we address in this study the 
neural mechanisms behind the selection process by incorporating them in a model of a layered visual 
memory. Here we take the competition and cooperation between the neuronal units as the functional basis 
for the structure formation ( |von der M alsbur g and Singer[|1988[|Edelman[|1993[ ) and provide modification 
mechanisms based on activity-dependent bidirectional plasticity ( |Bienenstock et al 
Singer, 1993) and homeostatic activity regulation ( pesai et aT 



1982; Artola and 



1999). We confront the system with a 



task of unsupervised learning and human face recognition using a database of natural face images. Our 
aim is then to demonstrate the formation of synaptic memory structure comprising bottom-up, lateral and 
top-down connectivity. 

Starting from an initial undifferentiated connectivity state, the system is able to form a representational 
basis for the storage of individual faces in a parts-based fashion by developing memory traces for each 
individual person over repetitive presentations of the face images. The memory traces are residing in the 
scaffold of lateral and top-down connectivity making up the content of the associative memory that holds 
the associatively linked local features on the lower and the configurational global identity on the higher 
memory layer. The recognition of face identity can then be explicitly signaled by the units on the higher 
memory layer (Fig. [T]). By performing this self-organization, the system solves a highly non-trivial and 
important problem of capturing simultaneously local and global signal structure in an unsupervised, open- 
ended fashion, learning not only the appearance of local parts, but also memorizing their combinations to 
represent the global stimulus identity explicitly in lateral and top-down connectivity. None of the previous 
works on unsupervised learning of natural object representation were able to solve this problem in this 
explicit form ( |Waydo and Koch] [2008| |Wallis et al4[2008| ). 

As a consequence of this explicit representation, the local facial features are interpreted in the global 
context of the identity of a person, making use of the structure formed in the course of previous experi- 
ence. This contextual structure can also be utilized in generative fashion to replay the memory content 
in absence of external stimuli, also supporting the mechanism of selective object-based attention. The 
binding of the local features and their identity label into a coherent assembly is done in the course of a 
decision cycle spanned by a common oscillatory rhythm. The rhythm modulates the competition strength 
and builds up a frame for repetitive local winner-take-all computation. As the agreement between incom- 
ing bottom-up, lateral and top-down signals gets continuously improved during the competitive learning, 
the bound assemblies tend to reflect more and more consistently the face identities stored in the memory, 
so that the recognition error progressively decreases. Moreover, the employment of the contextual con- 
nectivity speeds up the learning progress and leads to a greater capability to generalize over novel data 
not shown before. The advanced view on the structure formation as an optimization process driven by 
evolutionary mechanisms of selection and amplification may also serve as a conceptual basis for studying 
self- organization of generic subsystem coordination, independent of the nature of the cognitive task. 
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2. Materials and Methods 



2.1. Visual memory network organization 

Our model is based on two consecutive interconnected layers (Fig. [j}, which we tend to identify with 
the hierarchically organized regions of IT and PFC, containing a number of segregated cortical modules 



that will be termed columns (Fu jita et al.[|1992}|Mountcasfle[|1997[|Tanaka[[20031 ). The columns situated 
on the lower layer will be termed here bunch columns, as each of them are supposed to hold a set of 
local facial features acquired in the course of learning. The column on the higher memory layer will 
be called identity column as its task will be to learn the global face identity for each individual person 
composed out of distributed local features on the lower memory layer. Being a local processing module, 
each column contains further a number of subunits we call core units (or simply units), which receive 
common excitatory afferents and are bound by common lateral inhibition. Acting as elementary processing 
units of the network, the core units represent an analogy to a tightly coupled population of excitatory 



1997 



pyramidal neurons ("pyramidal core") as documented in cortical layers II/III and V ( [Peters et al. 
Rockland and Ichinohe[ 2004 ; Yoshimura et al. 2005 ). These populations are thought to be capable of 
sustaining their own activity even if afferent drive is removed. 

On the lower level of processing, each bunch column is attached to a dedicated landmark on the face to 
process the sensory signal represented by a Gabor filter bank extracted locally from the image (DaugmanJ 
1985; Wisk ottet al.[ \991) . The connections bunch units receive from the image constitute their bottom- 
up receptive fields (here, referring to a receptive field we always mean the pattern of synaptic connections 
converging on a unit). Furthermore, there are excitatory lateral connections between the bunch columns on 
the lower layer binding the core units across the modules. The bunch units also send bottom-up efferents to 
and get top-down afferent projections from the identity units situated on the higher level of processing. All 
the types of intercolumnar synapses are excitatory and plastic, the connectivity structure being all-to-all 
homogeneous in the initial state. 



2.2. Dynamics of a core unit 



A cortical column module containing a set of n core units is modeled by a set of n differential equations 
each describing the dynamic behavior of the unit's activity variable p. The basic form of the equation, 
ignoring the afferent inputs for the time being, is motivated by a previous computational study on a cortical 



column (Liicke 2005) 



dp 
dt 



ap 2 (l 



p) - f3p 3 - Ai/(max(p t ) - p)p, 



(1) 



where r is the time constant, a the strength of the self-excitability, (3 the strength of self-inhibitory effects, 
A the strength of the lateral inhibition between the units, v the inhibitory oscillation signal and max(p 4 ) 
the activity of the strongest unit in the column module. In this study we set for all units r = 0.02 ms, 
a = f3 = 1, A = 2. As p reflects the activity of a whole neuronal population receiving common afferents, 
we may assume a small time constant value, referring to an almost instantaneous response behavior of a 
sufficiently large (n = 100 or more) population of neurons (Gerstner, 2000). 

A crucial property of the column dynamics is the ability to change the structure of the stable activity 
states by variation of the parameter v. We take the oscillatory inhibition activity v (Fig. [2]) to be of a form 



v{t) 



+ 
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Figure 1: Layered visual memory model. (A) Two consecutive interconnected layers for hierarchical 
processing. On the lower bunch layer (IT, each column contains n = 20 units), a storehouse of local parts 
linked associatively via lateral connections is formed by unsupervised learning. On the higher identity 
layer (PFC, column contains m = 40 units), symbols for person identities emerge, being semantically 
rooted in parts-based representations of the lower layer. The identity units provide further contextual 
support for the lower layer by establishing top-down projections to the corresponding part-specific units. 
(B) Different face views used as input to the memory (one person out of total 40 used for learning shown). 
Top left is the original view with neutral expression used for learning. Other views were used for testing 
the generalization performance (bottom row shows the duplicate views taken two weeks after the original 
series.). (C) Facial landmarks used for the sensory input to the memory, provided by Gabor filter banks 
extracted at each landmark point 



4 



Inh. 




Figure 2: Excitatory (tu) and inhibitory (u) oscillation rhythms defining a decision cycle in the gamma 
range. 



with its period T = 25 ms being in the gamma range. v min and v max are the lower and upper bounds for 
oscillation amplitude, T init , k, g parameterize the form of the sigmoid activity curve. Here the values are 
set to v min = 0.005, Umax = 1-0, T init = 5ms, g = 0.5, k = 2. With the rising strength of the oscillatory 
inhibition, the parameter v crosses a critical bifurcation point of structural instability which is given by: 

a 

so that by inserting the given values of a and A we obtain u c = 0.5. For the range v < u c any units 
subset can remain active (with the stationary activity level p = ^rg) s as these states are stable given the 
low strength of lateral inhibition. After crossing the critical value u c , all those states having more than one 
unit active loose the stability, so that only a single winner unit can remain active on the level -^s. The 
bifurcation property realizes winner-take-all behavior of the column acting as a competitive decision unit 
( Lucke[ 2005 ) to select the best response alternative on the basis of the incoming input. 



The qualitative dynamical behavior stays the same in the extended formulation of the activity equation, 
which is: 

T ^ = au) (l + ^ATjLAT + JDjTD^q _ } _ ^ 3 

at 

- Awz/(max(p t ) - p)p + n BU I BU p 2 + 9p (4) 
+ ue + arj t p, 

where I BU , I LAT , I TD are the afferent inputs of respective bottom-up, lateral and top-down origin, k bu = 
k lat _ k td _ y ^ their coupling coefficients, u is an excitatory oscillatory signal, 6 an excitability 
threshold of the unit, a = 0.001 is parameterizing the multiplicative gaussian white noise r\ t and e is an 
unspecific excitatory drive. 9 is a dynamic threshold variable used for homeostatic activity regulation of 
the unit, it will be described later in detail; e depends on the total number of core units n, e = ^. 

An important modeling assumption is the separation of the synapses of different origin as implemented 
in Eqj4} This separation causes different synaptic inputs to have different impact on the activity of the unit. 
The functional difference can be made explicitly evident by taking a glance at the stable state of the winner 
unit (assuming for clarity a = e = 9 = 0), which takes the value 

_ au(l + K LATjLAT + K TDjTD^ + ^BU jBU 
PstaMe ~ au(l + K LAT I LAT + K TD I TD ) + (3 ' ( } 
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where bottom-up input I BU contributes to the activity level in a linear fashion, while the contribution of 
lateral and top-down inputs I LAT and I TD is non-linear, resembling the pure driving and hybrid driving- 



modulating roles of afferents from different origin commonly assumed for cortical processing (Sherman 



and Guillery[ 1998 ; Friston, 2005 p . The course of the activity is also influenced by the excitatory oscilla- 



tory activity u: (Fig. [2]), which is given by: 

mod(t,T) 

U[t) — Ulmin H — [Umax ~ ^min) i (6) 

where uj min = 0.25 and u max = 0.75 are the lower and upper bounds for oscillation amplitude. The 
excitatory oscillation doesn't make any impact on the critical bifurcation point v c , as it modulates the self- 
excitability strength a and the lateral inhibition strength A to the same extent (Eq. [4]). Instead, it elevates 
the activity level of the units as long as they manage to resist the rising inhibition and remain in the active 
state. In the state where lateral inhibition gets strong enough to shut down all but the strongest core unit, 
only this winner unit is affected by the elevating impact of the excitatory oscillation, being able to further 
amplify its activity at the cost of suppressing the others. Both inhibitory and excitatory oscillations may 
have presumably different sources, the former being generated by the interneuron network of fast- spiking 
(FS) inhibitory cells ( |Whittington et al.[|1995| ) and the latter having its origin in activities of fast rhythmic 



bursting (FRB), or chattering, excitatory neurons ( |Gray and McCormick 1996) 



In addition to the local competitive mechanism supported by the lateral inhibition within a column, 
we use a simple form of forward inhibition ( Douglas and Martin] 1991j ) acting on the incoming afferents. 



To model this, the incoming presynaptic activities are transformed as following before they make up the 
afferent input via the respective receptive field of a unit: 

fr = fr - \ E *n p re G i BU > LAT > td} 



K 

3 



K 



(7) 

jSource = ^ yjSourceppre ^ S OUrce G {BU, LAT, TD}, 



where p pre stands for raw presynaptic activity, fP re is the presynaptic activity transformed by forward 
inhibition, K is the total number of incoming synapses of a certain origin, the weights wf ource constitute 
the receptive field and I Source designates the final computed value of the afferent input from the respective 
origin. Although all plastic synaptic connections in the network are taken to be of excitatory nature, the 
forward inhibition allows units to exert inhibitory action across the columns. An important effect of this 
processing is the selection and amplification of strong incoming activities at the cost of weaker ones, 



which can be interpreted as presynaptic competition among the afferent signals (Douglas and Martin 
T99Tl|Swadlow|[20U3| ). 



An additional property of the dynamics is the natural restriction of the population activity values p to 
the interval between and 1 (Eq. [5]), given that the afferent input also stays in the same range. This allows 
both interpretations of the variable as either the population rate or the probability of an arbitrary neuron 
from the population to generate a spike. 

2.3. Homeostatic activity regulation 

The activity dynamics equation (Eq. [4]) contains the variable threshold 9, which regulates the excitability 
of the unit. Here, higher values of 9 stand for higher unit excitability, implying a greater potential to 
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become active given a certain amount of input. The threshold is updated according to the following rule: 

-jj; = T 9 {Paim- < P >), (8) 



pt+T 

where < p >= ^ / p(t)dt is the average activity of the unit measured over the period T of a decision 

cycle, p a ira specifies the target activity level and r is the inverse time constant (r e = 10 _4 ms _1 ). The 
target activity level p aim depends on the number of units n in a column, p aim = -. The initial value of the 
excitability threshold is zero, 9(0) = 0. 



The motivation behind this homeostatic regulation of unit's activity (Desai et al. , 1999, Zhang and 



Linden, 2003 1 is to encourage a uniform usage load across units in the network, so that their participation 
on the formation of the memory traces is balanced. Bearing in mind the strongly competitive character of 
the columnar dynamics, the regulation of the excitability threshold changes the a-priori probability of a unit 
to be winner of a decision cycle. So, if a certain unit happens to take part too frequently in encoding of the 
memory content, violating the requirement of the uniform win probability across the units, its excitability 
will be downregulated so that the core unit becomes more difficult to activate, giving an opportunity for 
other units to participate in the representation. Reversely, a unit being silent for too long is upregulated, 
so it can get excited more easily and contribute to memory formation. 

2.4. Activity-dependent bidirectional plasticity 

We choose a bidirectional modification rule to specify how a synapse connecting one core unit to another 
may undergo a change in its strength w: 

^ = ep^f ost U(x - A(t))H(f ost - 9 )nt(p post - 9t) (9) 
with the sign switch functions T-L(x) and 7tt(x) given as following 



U(x) = {' ~ , U + _(x) = {' ~ (10) 
W [0, x < [-1, x < 

providing the bidirectional form of the synaptic modification. The amplitude of the change is determined 
by the correlation between the presynaptic activity p pre and the postsynaptic activity pP ost , both variables 
being non-negative due to the properties of the unit activity dynamics. The learning rate £ = 5 • ms" 1 
specifies the speed of modification being the inverse time constant. Other variables determine the sign 
of the modification. The threshold 9t = max(pf os< ) is used to compare the postsynaptic activity against 
current maximum activity in the column. A(t) is the the total activity level in the postsynaptic column at 
time point t, A(t) = Y17=i Pi(^)> wnere n is me number of units in the column and pi(t) their activities at 
time point t. A(t) is compared to a variable gating threshold x, which pursues the average total activity 
level < A(t) > computed over the period T of a decision cycle: 



dx 
dt 



r x (< A(t) > -x), < A(t) >= ^J t+ A(t)dt (11) 



with t x = 10 3 ms 1 as inverse time constant, the threshold initial value set to x(0) = 0.5. Furthermore, 
the postsynaptic activity pP ^ is compared to the sliding threshold 9 that follows the average postsynaptic 
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activity < pP ost (t) > computed over the period T of a decision cycle 



dt 



r,-(< -0o ), <P pos \t)> 



t+T 



f OS \t)dt 



(12) 



with the inverse time constant r e - = 2 • 10 3 ms 1 , the initial value of the threshold 9 (0) = p aim being 
equal to the target postsynaptic activity level (see Eq. [8]). 
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Figure 3: Bidirectional plasticity. (A) Experimentally grounded modification rule (ABS, Artola and 
Singer[ 1993j ) (B) A simplified sign switch rule used in the model. 



The rule employed here is a simplified version of a bidirectional modification assuming the existence 
of two sliding thresholds 6 and 6t (Fig. [3]), which subdivide the range of postsynaptic activity into zones 
where no modification, depression or potentiation may occur, resembling BCM and ABS learning rules 



rooted in neurophysiological findings (Bienenstock et al. , 1982 ; Artola and Singer, 1993 ; Bear, 1996 , Cho 



et al. 2001 ). If the postsynaptic activity level is too low (pP ost < 0~), no modification can be triggered. A 
mediocre level of activation (9q < p post < dt) promotes long-term depression (LTD, negative sign), and 
a high level of activity (j> post > 9t) makes long-term potentiation (LTP, positive sign) possible. Combined 
with the winner-take-all-like behavior of the core units, the intended effect of the rule is to introduce 
the competition in synaptic formation across the receptive fields of the units, enabling them to separate 
patterns even if they are highly similar and overlap strongly. If multiple core units are frequently co- 
activated by a stimulus, the winner unit gets an advantage in potentiating its stimulated synapses, while the 
stimulated synapses of the units with lower activity either do not change or are affected by the depression. 
If this situation occurs over and over, the receptive fields of previously co-activated units are supposed 
to drift apart preferring the structure where strong synapses are not in conflict with each other anymore, 
emphasizing the discriminative features of the patterns preferred by the units. 

In addition, we here use multiplicative synaptic scaling applied to synapses grouped according to their 
origin (bottom-up, lateral and top-down). We model this simply by L 2 -normalization of the receptive 



field vector, w. 



~ Source 



Source 



/I 



W 



Source I 



with w 



Source 



as a weight of the receptive field comprising the 



synapses of the respective origin Source 6 {BU, LAT,TD}, and yjS° urce its normalized version. The 
normalizing procedure can be applied after a number of decision cycles, here we choose this number to be 
10 cycles. The scaling mechanism promotes competition between synapses within the receptive field, as 
the growth of one synapse happens at the cost of the weakening the others (Miller and Mac Kay, 1994). 



2.5. Open-ended unsupervised learning and performance evaluation 

Data format. To provide the system with natural image input, we choose the AR database containing 



grayscale human face photographs of 126 persons in total (Martinez and Benavente, 1998). For each 
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person, there is a number of views taken under different conditions (Fig. [T]B). The original view with 
neutral facial expression is accompanied by a duplicate view depicting the same person at a later time 
point (two weeks after the original shot). Furthermore, there are variations in emotional expression such 
as smiling or sad for both original and duplicate views. The images were automatically prelabeled with a 
graph structure put upon the face, positioning nodes on consistent landmarks across different individuals 
with a software (EAGLE) based on the algorithm described in (Wis kott et al.[ [i~9"9~7| ). A subset of L = 6 
facial landmarks was selected around the eyes, nose and mouth regions (Fig. [TJC), each landmark being 
subserved by a single bunch column. Being attached to a dedicated facial landmark, each bunch column 
is provided with a sensory image signal represented by a Gabor filter bank extracted locally. The Gabor 
wavelet family used for the filter operation is parameterized by the frequency k and orientation ip of the 
sinusoidal wave and the width of the gaussian envelope a (Daugman, 1985). We use s = 5 different 
frequencies and r = 8 different orientations sampled uniformly to construct the full filter bank (for more 
details refer to ( Wiskott et al.| 1997[ )). The local filtering of the image produces a complex vector of 
responses, containing both amplitude and phase information. We use only the amplitude part consisting 
of s ■ r = 40 real coefficients to model the responses of complex cells. This amplitude vector is further 
normalized by L 2 -Norm to serve as bottom-up input for the respective landmark bunch column of the 
lower memory layer. 

Network configurations. Selecting randomly P = 40 persons from a database, we allocate n = 20 
core units for each bunch column to ensure that multiple persons have to share some common parts. The 
identity column then contains m = 40 units corresponding to the number of persons we want be able to 
recall explicitly. Two different configurations of the memory system are employed to test our hypothesis 
about the functional advantage of a fully recurrent structure over the purely feed-forward one. Each 
configuration is supposed to form the memory structure in the course of the learning phase. While the 
fully recurrent configuration learns bottom-up, lateral and top-down connectivity, the purely feed-forward 
configuration is a stripped-off version using only the bottom-up pathways. Observing these different 
configurations during the learning phase and testing them on novel face views subsequently, we are able to 
compare both in terms of learning progress and performance on the recognition task to find out potential 
functional differences between them. 

Simulation. In order to run the memory network, the solutions for the differential equations governing 
the behavior of dynamical variables have to be computed numerically in an iterative fashion. We use a 
simple Euler method with a fixed time step At = 0.02ms to do this. To save computational time, slow 
threshold variables are updated once in a decision cycle, correcting the time steps accordingly. 

Open-ended unsupervised learning. The system starts with homogeneously initialized structure 
parameters, all threshold values and all synaptic weights being undifferentiated, so that intercolumnar all- 
to-all connectivity is the initial structure of the memory network. During the iterative learning procedure, 
for each decision cycle a face image is selected from a database randomly and presented to the system, 
evoking a pattern of activity on both memory layers and triggering synaptic and threshold modification 
mechanisms. The learning procedure is open-ended as there is neither a stop condition nor an explicitly 
defined time-dependent learning rate variables which would decrease with time progress and freeze modi- 
fications at some point. The learning progress can be assessed directly by evaluating the recognition error 
on the basis of the previous network responses. Further, the inspection of the structure of the receptive 
fields delivers hints about their maturation progress. Investigating the rate of ongoing modifications of the 
synaptic weights and dynamic thresholds could give a hint on whether the changes in the network structure 
are still taking place in significant proportion, providing a basis for a stop condition if necessary. In the 
later learning phase the general stability of the established structure can be also verified by simple visual 
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inspection. 

Performance evaluation. To assess the recognition performance of the system, we make a distinction 
between the learning and generalization error. The learning error is defined as a rate of wrong responses 
to person identity from the training data set containing the original face views with neutral expression. 
The statistics of response behavior to each particular person is gathered for each identity core unit over the 
history of the network stimulation. The learning error rate can then be computed for each small interval 
during the learning phase by using the preferences the identity units have developed for the individual 
persons during the preceding stimulation. Opposed to this, the generalization error is computed on the 
set of novel views not presented before. During the test for generalization error, all the synaptic weights 
are frozen, which is done to exclude the possibility that recognition rate improves during the testing phase 
due to potential benefit of synaptic modifications. The generalization error is assessed for each view type 
separately to see potential performance differences between different views (the duplicate view and the 
views with two different emotional expressions, smiling and sad). The history of network behavior during 
the learning phase is used again in the same way for the computation of the error rate, as done for the 
learning error evaluation. 



2.6. Assessing network's organization 

To analyze the progress of structure formation, we use measures describing different properties of the 
receptive fields. The distance measure calculates the distance between two synaptic weight vectors w, and 



d(wi,Wj) 



1 / Wj Wj 



4 MKII2 ll w ill 2 ^ (13) 

where <fi denotes the angle between the two synaptic weight vectors each comprising a receptive field. The 
value lies in the interval between zero and one. If the weight vectors are the same, the distance value is 
zero, if their dissimilarity is maximal (a = n), the value is one. Utilizing this basic distance measure, 
we further construct a differentiation measure, which is supposed to reflect the grade of differentiation 
between the receptive fields of the same type across the whole network. The differentiation grade D% ource 
is computed for each column for the receptive fields of a given type Source G {BU, LAT, TD} and then 
an average differentiation value D Source is built from the values of all K columns: 
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where n is the number of units in the column. The differentiation grade measure is evaluated separately 
for bunch columns on the lower memory layer and for the identity column on the higher memory layer. 

Further we employ a measure reflecting the property of the inner structure of a receptive field to 
be sparse, that is, possessing few strong synapses and many weak synapses comprising the receptive 
field. If the inner receptive field structure is poorly differentiated the sparseness value will be low; if 
differentiation within the receptive field is strong, then the value will be high. To assess the same property 
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not only within, but also across receptive fields, the overlap measure is defined. If the receptive fields of 
the same type have many strong overlapping synapses in common the value will be high, if there are only 
few such overlapping synapses the value will be low. The overlap measure is thus closely related to the 
differentiation grade between the receptive fields as assessed using the distance measure. Both sparseness 
denoted as £ and overlap denoted as £ have the same scheme behind their computation, with the only 
difference that the former is computed within while the latter across the receptive field vectors using a 
common selectivity measure A Source (s) as defined in (Rolls and Tovee 1995). Again, the computation 



is done for each column on receptive fields of the same type Source 6 {BU, LAT, TD}, building then 
type-specific average values C Source and £ Source over all K columns: 
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where r is the number of synapses comprising a receptive field of type Source G {BU, LAT, TD}, n is 
the number of units in a column, and K is the total number of assessed columns. The evaluation is done 
separately for the bunch columns and the identity column. 
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3. Results 



3.1. Structure formation 

Facing a task of unsupervised learning, the system develops a structural basis for storing the faces of 
individual persons shown during the learning phase. The vocabularies for the distributed local features 
are created on the lower memory layer to represent facial parts. These vocabularies are formed by the 
bottom-up synaptic connections of the bunch columns attached to their facial landmarks. Each core unit 
of the bunch columns becomes thus sensitive to a particular local facial appearance due to the established 
structure of its bottom-up receptive field. At the same time, the lateral connectivity between the bunch 
columns gets shaped capturing the associative relations between the distributed features. These relations 
are represented by associative links between those core units that are regularly used in the composition of 
a particular individual face. The same configurational information enters into the structure of bottom-up 
connectivity converging on the identity column units, being also represented in the top-down connections 
projecting from the identity column back on the lower layer. 

Each person repeatedly presented to the system during the learning phase leaves a memory trace com- 
prising the parts-based representation of its face on the lower layer and the explicit configurational identity 
on the higher layer of the memory (Fig. [4]). The course of gradual differentiation of bottom-up, lateral 
and top-down connectivity reveals the ongoing process of memory consolidation, where memory traces 
induced by the face images become more stable and get opportunity to amplify their structure. A common 



developmental pattern seems to underlie the time courses of structure organization (Sec. 2.6). There is an 
initial resting phase, where no structural changes appear, followed by a maturation phase, where massive 
reorganization occurs and change rate peaks at its maximum value (Fig. [5J[6]). Finally a saturation phase 
is reached, where the structure stabilizes at a certain level of organization and the change rate goes down 
close to zero. 

Different connectivity types get organized preferentially within a specific time window (Fig. [5j [6]). 
There is a clear temporal sequence of connectivity development, starting with maturation of lower layer 
bottom up connections, followed by maturation of lateral connections between the bunch columns and 
by the maturation of bottom-up connectivity of the identity column, ending with the formation of top- 
down connectivity. Because the development of different connectivity types is highly interdependent, 
their developmental phases are not disjunct in time, but overlap substantially. In parallel, there is a gradual 
increase in sparseness within the receptive fields and progressive reduction of the overlap between them. 
(Fig. [6]) The remaining overlap in associative lateral and configurational bottom-up connectivity reflects 
the extent to which the parts are shared among different stored face representations. 

In the late learning phase, the state of the synaptic structure stabilizes until no substantial changes in 
the established memory structure can be observed (Fig. |5J[6]). Remarkably, the bottom-up connectivity of 
the bunch columns stays well behind other connectivity types in terms of differentiation grade, sparseness 
within the receptive fields and their overlap reduction achieved in the final stable state (Fig. |5J[6]). While 
being the latest to initiate its maturation, the top-down connectivity reaches the highest grades of differenti- 
ation and sparseness, also being most successful in reducing the overlap. The lateral connectivity between 
the bunch columns and bottom-up connectivity of the identity column also show comparably high level of 
organization. These relationships reflect the distinct functional roles the different connectivity types play 
in their contribution to the memory traces - capturing strongly similar local feature appearance in case 
of lower layer bottom- up connectivity on the one hand and on the other hand storing weakly overlapping 
associative and configurational information for different faces in case of lateral and top-down connectivity. 
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Figure 4: Time snapshots of structure formation. From left to right, snapshots from early, middle and late 
formation phase of (A) lower layer bottom-up connectivity containing local facial parts, (B) lower layer 
associative lateral connectivity, (C) top-down compositional connectivity projecting from the higher back 
on the lower layer, which is roughly the transposed version of the higher layer bottom-up connectivity 
visualized in (D), holding global identities. 




Figure 5: Differentiation time course over 5 • 10 5 decision cycles for different connectivity types; on the 
left the grade of differentiation, on the right its rate. Clear is the general tendency to greater connectivity 
differentiation with the learning progress as well as the temporal sequence of connectivity maturation (see 
the text). BU, LAT, hBU, TD denote respectively lower layer bottom-up, lateral, higher layer bottom-up 
and top-down connectivity types. 




Figure 6: Overlap (A) and sparseness (B) time course over 5 • 10 5 decision cycles for different connectivity 
types. As the learning progresses, the overlap between the receptive fields is continuously reduced, the 
connectivity sparseness increases. Again, the temporal sequence of connectivity development is clearly 
visible (see the text). BU, LAT, hBU, TD denote respectively lower layer bottom-up, lateral, higher layer 
bottom-up and top-down connectivity types. 
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The changes in the synaptic structure are accompanied by the use-dependent regulation of the excitabil- 
ity thresholds of the core units across the network. Three developmental phases can be distinguished in 
the time course of excitability modifications (Fig. [7]). The first phase is characterized by strong and rapid 
excitability downregulation in the network. This downregulation settles down the core units toward the 
range of the targeted average activity level p a i m ( Eq. [8]). In this phase, almost no differences between the 
individual thresholds are present (Fig. [8]). After downregulation crosses its peak, a common upregulation 
sets in and the differences between the excitability thresholds become much more prominent. The upreg- 
ulation phase leads to a slight increase of the average excitability and is followed by a saturation phase 
where the average threshold value stabilizes around certain level. 



Excitability 

nigh 




Figure 7: Time course of excitability regulation. Above the lower, below the higher memory layer. Obvi- 
ous are the much stronger pronounced differences in excitability between the units on the lower layer. 

Excitability regulation runs differently on different memory layers. On the lower layer the down- and 
upregulation phases are shorter and occur earlier than the corresponding phases on the higher layer. More- 
over, the differences in excitability between the units on the lower layer are much stronger pronounced 
compared to the rather equalized excitability levels of the higher layer units (Fig. [7] and [8]). 

These differences reflect the distinct functional roles the lower and higher layer play in the memory 
organization. The lower layer serves as a storehouse for associatively linked distributed facial parts that 
can be shared by multiple face representations, while the identity units are conjunction-sensitive units 
representing the configurational identity of the face. Because each memorized person is equally likely to 
appear on the input, the long-term usage load of the identity units is essentially the same, so no need for 
a systematic differentiation of excitability thresholds arises there. Part sharing on the other hand imposes 
different usage frequency on different core units sensitive to different parts, leading to pronounced use- 
dependent differences in excitability between the bunch column core units. 
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Figure 8: (A) Time course of average excitability regulation. Above the whole course, below the zoom 
into down- and upregulation phases. On the left for the bunch units, on the right for the identity units. 
Black solid curve is the average value, gray curves mark the standard deviation range. The same nomen- 
clature applies for the time course of the average unit activity visualized in (B). The much stronger pro- 
nounced differences in excitability between the units on the lower layer are reflected in the greater disper- 
sion of their activities around the average activity level on the lower layer. 



3.2. Activity formation and coordination 

The established synaptic structure supports the parts-based representation scheme by encoding the re- 
lations between the parts in two alternative ways. First, the relations can be explicitly signaled by the 
responses of conjunction, or configuration, specific identity core units on the higher layer, each respon- 
sible for one of the face identities stored in the memory. Second, the relations can be represented by 
dynamic assemblies of co-activated part-specific bunch core units, which can be constructed on demand 
to encode a novel face or to recall an already stored one as a composition of its constituent parts. The 
selection and binding of the parts- specific and identity- specific units into a coherent assembly coding for 
an individual face is done in the course of a decision cycle defined by common unspecific excitatory and 
inhibitory signals oscillating in the gamma range ( Singer[ 1999, Fries et aL}|2007[ ). 



There, the global decision process which may be called binding by competition is responsible for 
assembly formation, providing clear and unambiguous temporal correlations between the selected units 
and setting them apart against the rest by amplification of their response strength (Fig. [9]). The initial 
phase of the decision cycle, where the oscillatory inhibition and excitation are low, is characterized by 
low undifferentiated activities of the network units. With both inhibition and excitation rising, only some 
of the units are able to resist the inhibition pressure and continue increasing their activity being selected 
as candidates for assembly formation in the selection phase. Ultimately, the growing competition leads 
to a series of local winner-take-all decisions across the columns sparsening the activity in the network by 
strong amplification of a small unit subset at the cost of suppression of the others. In the late phase of a 
decision cycle, this amplified subset of winner units can be then clearly interpreted as an individual face 



16 



composed of the local features from respective landmarks and labeled with person's identity, solving the 
assembly binding problem ( |von der Malsburg[ |1999[ |S~mger[ |1999| ). 
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Figure 9: Activity formation during the decision cycle. (A) A sequence of six successive cycles, each 
representing a successful recall of a stored individual face. On the top, the activity course is shown, 
arrows pointing to constituent parts shared by two different face identities. Second and forth cycles show 
recall of the same face identity. Below is the mean activity course for each column and the oscillation 
rhythms defining the decision cycle. (B) A zoom into a single decision cycle (on the top) to visualize the 
activity formation phases. Below is the mean activity course for each column and distribution of average 
unit activities over the decision cycle showing the highly competitive nature of activity formation, where 
winner units get amplified at the cost of suppressing the others. 

A combined view on the mean activity within the columns reveals once more the competitive nature 
of activity formation in the network (Fig. [9]). While the winner unit subset concentrates increasingly 
high activity, the mean network activation gets progressively reduced at the end of the decision cycle after 
crossing its peak in the selection phase, indicating that winner subset amplification occurs at the cost of 
suppressing the rest. Generally, during the whole decision cycle the mean network activity stays at a low 
level (p = 0.08 — 0.09), far below the activity level reached by the winner units subset at the end of the 
cycle (p = 0.4 - 0.6). 

One may ask to what extent the competitive activity formation becomes more organized or coherent 
in terms of representing the memory content as the learning progresses. In other words, we are interested 
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in the level of coherence, or agreement, between the local competitive decisions made in the distributed 
columns and how it may change with the learning time. One indicator of such coherent behavior is the 
agreement achieved at the end of the decision cycle between the afferent signals that arrive at network 
units from different sources such as bottom-up, lateral or top-down. By computing the standard correlation 



coefficient p (DeGroot and Schervish 2001 ), we obtain for each afferent signal pair of different sources a 
course showing the development of the coordination between the signals over the learning time. 
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Figure 10: Improvement of signal coordination in the course of learning. Standard correlation coefficients 
p were computed for each signal pair. BU, LAT, TD denote respectively bottom-up, lateral and top-down 
signals. 



The coordination level between the bottom-up, lateral and top-down signals increases gradually from 



the initially very low value close to zero toward higher and higher grade (Fig. 10). The low coherence 
value in the early learning phase reveals the inability of the signals converging on the network units to be 
in consensus with each other about the local decision outcome, deranging the global decision making. As 
learning progresses, the signal pathway structure is gradually improved for the storage and representation 
of the content, leading to stronger and stronger consistency in local signaling. The bottom-up and lateral 
signals are the first to develop a significant grade of coherence. Slightly later the lateral and top-down 
signals reach a substantial coherence level and the latest to establish a coordinated cross-talk are the 
signals from bottom-up and top-down sources. Furthermore, the lateral and top-down signals establish the 
strongest final grade of coherence that is significantly higher than the coherence between bottom-up and 
lateral as well as bottom-up and top-down signals. Their coherence still reaches substantial values though, 
the former being slightly above the latter. 

During the course of a single decision cycle, a co-activation measure can be used to check whether 
the incoming signals are coordinated properly to make up the decisions. The relationship between the 
afferent signal coordination and the function of the memory is particularly clear if the coordination level 
in a successful recall is compared to the coordination shown during a failed recall, where the identity of 



the person is misclassified (Fig. 11). In a successful recall, where the facial representation and person's 
identity are correctly retrieved from the memory, a well-established coordination can be observed between 
the co-active afferent signals converging on the winner units. In a failed recall, the identity column making 
a wrong decision sends top-down signals that are not in agreement with the bottom-up and lateral signals 
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conveyed by the bunch columns. As consequence, the signal coordination breaks down, serving as a 



reliable indicator of a recall failure (Fig. 1 1 (D)). 

A further indicator that can help in differentiating a successful from a partially or completely failed 
recall is the activity level of the winner units at the end of the decision cycle. A successful recall is 
accompanied by a high degree of cooperation between the participating winner units, so that the level of 
their final activation is high. At the same time, the competitive action of the winner units subset suppresses 
strongly the rest activity, so that the overall network activity is substantially diminished. Contrarily, a 
failed recall has something to do with disagreement between some local decisions, resulting in decreased 
afferent signal coherence, which in turn leads to a much lower level of final activity in the winner units. 



Their competitive influence is also weakened, leading to a higher overall network activity (Fig. 11 (F)). 
Thus, a simple comparison of the winner activities to their average level can already provide enough 
information to conclude about the quality of recall. The recall quality can be assessed on the global level 
of identity as well as on the component level, where either identity recognition failure or part assignment 
failure might be stated. 

3.3. Recognition performance 

To assess the recognition capability of the memory, we evaluate the learning and generalization error of 
two different system configurations. These different configurations, the fully recurrent and purely feed- 
forward one, are set up to substantiate the hypothesis stating the functional advantage of the recurrent 
memory structure over the structure with purely feed-forward connectivity. Both configurations were 
trained under equal conditions and then tested to compare their performance against each other (refer to 



Sec. 2.5). 



Both the purely feed-forward and fully recurrent configurations are able to successfully store the facial 
identities of the persons (40 in total) in the memory structure. Strong decay of the learning error over the 
time is clearly evident for both network configurations. The learning error rate falls rapidly in the early 
learning phase (first 5 • 10 4 decision cycles) until it saturates at the values slightly below 5% in the later 



phase beyond 10 5 cycles (Fig. 12). Although there is no significant difference in the learning error rate 
between the both configurations after the saturation level is reached, the time needed to reach the saturation 
level is substantially shorter for the fully recurrent configuration (saturates around 10 5 cycles) than for the 
purely feed-forward one (saturates around 1.5 ■ 10 5 cycles). Thus, the learning progresses about 33% faster 
for the fully recurrent system than for the purely feed-forward one. The fully recurrent configuration 
seems to speed up the learning progress in the critical early learning phase, probably taking benefit of 
additional assistance provided by lateral and top-down connectivity for the organization, amplification 
and stabilization of the memory traces. 

At first glance, analysis of the learning error time course suggests that the only functional advantage 
of the fully recurrent configuration is the learning speed-up observed in the early phase. However, another 
important functional advantage is revealed if the generalization error rates are compared. The generaliza- 
tion error is measured on the alternative face views not shown during the learning phase (see Tab. [I). A 
striking result is the significant discrepancy in performance between the two configurations manifested on 
the duplicate views containing emotional expressions (smiling and sad). There, the error rate difference is 
about 5% in favor of the fully recurrent memory configuration. The generalization error of purely feed- 
forward configuration is 38.46% larger on the duplicate smiling view and 62.5% larger on the duplicate sad 
view than the generalization error of the fully recurrent configuration. On the other views, no significant 
difference in error rate can be detected between both configurations. 
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Figure 11: Coordination and activity formation in successful and failed recall. Two decision cycles show- 
ing failed and successful recall. (A) Network activity course. (B) Bottom-up afferent signals course. (C) 
Lateral and top-down afferent signals course. (D) Signal coordination course assessed by measuring the 
co-activation of bottom-up, lateral and top-down signals converging on the network units. In the failed 
recall, there is a clear break-down of signal coordination in afferents converging on the winner units. (E) 
Course of mean activity in the columns. In the failed recall, a substantially increased overall activation is 
clearly seen as well as the shift of its broader peak to a later time point. (F) Winner unit activities at the 
end of the decision cycle on the left and mean unit activities (excluding the winners) over whole cycle on 
the right for each column. In the failed recall, winner activities are consistently lower, while the mean rest 
unit activities are consistently higher than in the successful recall. 
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Figure 12: Learning error rate of feed-forward and fully recurrent memory configuration. 



Configuration 


Views, Error Rate 
Original Smiling Sad 


fully recurrent 
purely feed-forward 


0.1% ±0.07% 6.06% ±0.58% 4.02% ± 0.42% 
0.067% ± 0.0528% 5.72% ± 0.92% 3.75% ± 0.38% 




Configuration 


Views, Error Rate 
Duplicate Duplicate, smiling Duplicate, sad 


fully recurrent 
purely feed-forward 


1.64% ±0.16% 13.41% ±0.94% 8.74% ±0.38% 
1.75% ±0.13% 18.42% ±0.93% 13.68% ± 0.64% 



Table 1 : Comparison of generalization error between the purely feed-forward and fully recurrent memory 
configuration. The configurations were tested after learning time of 5 • 10 5 cycles, fully recurrent configu- 
ration shows a significantly better performance on the duplicate views with emotional expressions, while 
comparable performance is shown on the other views. 
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These results highlight an interesting property of the functional advantage as it has been assessed for 
the fully recurrent memory configuration. The purely feed-forward configuration falls significantly behind 
the fully recurrent one only on certain views, performing comparably well on the others. Apparently, the 
stronger the deviation of the alternative view from the original view showed during the learning, the more 
evident is the enhancement in generalization capability. Even if given only a short time of a single decision 
cycle, the recurrent connectivity seems to gain benefit particularly in novel situations, where purely feed- 
forward processing alone has more difficulties in achieving correct interpretation of the less familiar face 
view. 
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4. Discussion 



To identify potential neural mechanisms that are responsible for the formation of parts-based represen- 
tations in visual memory, we examined the process of experience-driven structure self-organization in a 
model of layered memory. We chose the task of unsupervised open-ended learning and recognition applied 
to human faces from a database of natural face images. The final goal was to build up a hierarchically orga- 
nized associative memory structure storing faces of individual persons in a parts-based fashion. Employing 
slow activity-dependent bidirectional plasticity (Bienenstock et al. 1982, Artola and Singer, 1993^ |Cho] 
etal.[|200T| ) together with homeo static activity regulation ( |Desai et ah] |1999^ [Zhang and Linden[|2003[ ) and 



a fast neuronal population dynamics with a strongly competitive nature, the proposed system performed 
impressively well on the posed task. It demonstrated the ability to simultaneously develop local feature 
vocabularies and put them in a global context by establishing associative links between the distributed 
features on the lower memory layer. On the higher layer, the system was able to use the configurational 
information about relatedness of the sparse distributed features to memorize the face identity explicitly 
in the bottom-up connectivity of identity units. The captured feature constellations were also projected 
back to the lower layer via top-down connectivity providing additional contextual support for learning and 
recognition. The identity recognition performance of the system on the original and alternative face views 
confirmed the functionality of the established memory structure. 

Generic memory architecture. When thinking about the processes underlying the memory forma- 
tion and function, it is remarkable that the structure and activity formation in the model network can be 
governed by a set of local mechanisms which are the same for all neuronal units and all synapses compris- 
ing the network. Saying that they are the same here means that for instance the bidirectional plasticity rule 
for any synapse has not only the same functional description, but also shares a common set of parameter 
values such as time constant, etc. This supports the view that the synapses arriving from different origins 
and contacting their target neuron at different sites of the dendritic tree and soma are a kind of universal 



learning machines, which may well differ in their impact on the firing behavior of the neuron ( Sherman and 



Guillery, 1998, Larkum et al. 2004; Friston[ 2005), while obeying the same generic modification rules. 



Whether this is indeed the case, is currently a subject of intense debates (Sjostrom et al. 2008 ; Spruston[ 



2008 1. Overall, the organization of the system supports the idea of universal cortical operations involving 



strong competitive and cooperative effects ( |von der Malsburg and Singer]|1988[ ), which are building up on 
essentially the same local circuitry and the same plasticity mechanisms utilized in different cortical areas 
QMountcastlel \TWJ\ |FhTllips and Singer] \TWJ\ |Douglas and Martin] [2004] ). 

Competition and cooperation in activity and structure formation. In our study, it becomes clear 
that learning itself has to rely on certain important properties of the processing on the fast as well as on 
the slow time scale. To capture statistical regularities hidden in the local sensory inputs and their global 
compositions, there have to be mechanisms for selecting and amplifying only a small fraction of available 
neuronal resources, which then become dedicated to a particular object, specializing more and more for 
the processing of its local features and their relations. Without proper selection, no learning will succeed. 
However, without proper learning, no reasonable selection can be expected either. Here, we break this 
circularity by proposing strong competitive interaction between the units on the fast activity time scale. 
Given a small amount of neural threshold noise, this interaction is able to break the symmetry of the ini- 
tial condition due to the bifurcation property of the activity dynamics (Liicke 2005[ ), enforcing the unit 
selection and amplification in the initial learning phase even in the absence of differentiated structure. The 
response patterns enforced by competition offer sufficient playground for the learning to ignite and move 
on to organize and amplify some synaptic structure that is suitable for laying down specific memory con- 
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tent via ongoing slow bidirectional Hebbian plasticity. In combination with competitive activity dynamics, 
the bidirectional nature of synaptic modification assists further the competition between memory traces as 
it attempts to reduce the overlap between the patterns the network units preferentially respond to, segre- 
gating memory traces in the network structure whenever possible. The state of undifferentiated structure 
is however the worst-case scenario and not necessarily the initial condition for learning, as there may be 
basis structures prepared for the representation of many behavioral relevant patterns, like for instance faces 
(|Johnson et al.[[T99T]). Interestingly, the progress from an undifferentiated to a highly organized state via 



selection and amplification of a small subset of totally available resources is a general feature in evolution- 
ary and ontogenetic development of biological organisms. The notion that the very same principles may 
guide the activity and structure formation in the brain supports the view of learning as an optimization 



procedure adapting the nervous structure to the demands put on it by the environment (von der Malsburg 
and Singerj [19881 |Edelman[ [T993] ). 



Noteworthy, there is a very important difference in the way how the unit selection, or decision making, 
is implemented by competition given the early, immature or late, mature state of the connectivity structure. 
In the immature state where the contextual connectivity is not established yet, the local decisions in the 
lower layer bunch columns are made completely independent from each other. On the contrary, decision 
making in the mature state involves interactions between the local decisions via already established lat- 
eral and top-down connections. These associative connections enable cooperation within and competition 
between unit assemblies, promoting a coordinated global decision. The separation of synaptic inputs en- 
ables decision making to use information from different origins according to its functional significance - 
carrying either sensory bottom-up evidence for a local appearance or providing clues for relational bind- 
ing of distributed parts into a global configuration (Phillips and Singer, [1997). The agreement between 
sensory and contextual signaling about the outcome of local decisions improves continuously as learn- 
ing progresses. The initially independent local decision making becomes thus orchestrated by contextual 
support formed in the course of previous experience with visual stimuli. 

Signal and plasticity coordination. The coherency of cooperative and competitive activity formation 
cannot be guaranteed by the contextual support alone, as the time coordination of decision making across 
distributed units also matters. The decision cycle, which defines a common reference time window for 
decision making, orchestrates not only the activities, but also bidirectional synaptic modifications across 
the units. This reassures that structure modification amplifies the connections within the right subset 
of simultaneously highly active units encoding a particular face. The cortical processing seems to be 
reminiscent of oscillatory rhythms in the gamma range used here to model the decision cycle. Particularly, 
there is evidence that oscillatory activity may serve as reference signal coordinating plasticity mechanisms 
in cortical neurons (Hu erta and Lisman] |1995 ; [Wespata t et al.[ 2004[ ). There is also support for a phase 



reset mechanism locking the oscillatory activity on the currently presented stimulus (Makeig et al. , 2002 



Axmacher et al. J2006 ). Taken together, current evidence suggests the possible interpretation of the gamma 



cycle as a rapidly repeating winner-take-all algorithm as it is modeled in this work (Fries et al. 2007 1. The 
winner-take- all competition can be carried out rapidly due to low latencies of fast inhibition and its result 
can be read out fast (on the scale of few milliseconds) due to the response characteristics of the population 
rate code dGerstner[|2000p . 

Hierarchical parts-based representation. An essential property of our memory system is parts shar- 
ing, as it allows the same basic set of the elementary parts to be used for the combinatorial composition 
of familiar and novel objects without the need to add new physical units into the system. Endowed with 



this ability, the memory network can be also interpreted as a layered neuronal bunch graph (Wiskott et al. 



1997), without taking into account the topological information. Here, the graph nodes are columns, each 
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holding a set of features with similar physical (visual appearance) or semantical (category or identity) 
properties (Tanaka, 2003). In such a graph, new object representations can be instantiated in a combi- 
natorial fashion by selecting candidate features from each node. The candidate selection here depends 
critically on the homeostatic regulation of activity, which reassures that each unit is able to participate in 
memory formation to an equal extent. By introducing the hierarchy in the graph structure, higher order 
symbols, like identity of a person, can be explicitly represented by assigning the chosen set of candidate 
features from the lower memory layer to an identity unit on the higher layer. These higher symbols may 
be used for a compact representation of exceptionally important persons (VIPs), without discarding the in- 
formation about their composition which is kept in the top-down connections projecting back to the lower 
layer. Potentially, it would be also possible to select multiple candidates from a single node, or column, to 
represent an individual face. Here we use very strong competition leading to a form of activity sparseness 
termed hard sparseness (Rehn and Sommer, 2007), limiting the number of active units to one per column. 
While this kind of sparse coding is advantageous for learning individual faces, it may be generally too 
sparse for representing coarser categories (like male of female). However, the competition strength can 
in principle be adjusted arbitrarily in a task-dependent manner, either by tuning the core unit gain or by 
balancing the self-excitation and lateral inhibition. The latter can be easily implemented by altering the 
amplitude of inhibitory or excitatory oscillations. The alteration could be initiated by some kind of inter- 
nal cortical signal or state, indicating the task-dependent need for the competition strength. The tuning of 
the competition strength would allow the formation of less sparse activity distributions, representing the 
stimulus on a coarser categorical level ( |Kim et aL 2008). 

Attentional and generative mechanisms in the memory. Interestingly, contextual lateral and top- 
down connectivity endows the system with further general capabilities. For instance, selective object- 
based attention is naturally given in our model, because the priming of the identity units on the higher 
memory layer by preceding sensory or direct external stimulation would also prime and facilitate the part- 
specific units on the lower layer via top-down connections, providing them with a clear advantage in the 
competition against other candidates. This priming can mediate covert attention directed to a specific 
object, promoting the pop out of its stored parts-based representation while suppressing the rest of the 
memory content. Generally, the selection and amplification by competition can be interpreted as an at- 
tentional mechanism, which focuses the neural resources on processing one object or category at the cost 
of suppressing the rest (Lee et al. 1999 ; Reynolds et al. , 1999[ ). Although not exploited in this study, the 
network model is also able to self-generate activity patterns that correspond to the object representations 
stored in the memory content in absence of any external input. This ability relies heavily on the lateral 
and top-down connectivity established by previous experience with visual stimuli, placing the model in 
remarkable relation to generative approaches explaining construction of data representations in machine 
learning ( |Ulusoy and Bishop , 2005 [ ). From this perspective, each face identity can be interpreted as a 
global cause producing the specific activity patterns in the network. The identities are in turn composed 
of many local causes, i.e. their constituent parts. The memory structure captures all the relations between 
local and global causes, being able to reproduce data explicitly in an autonomous mode. 

Performance advantage over the purely feed-forward structure. Finally, we presented sound ev- 
idence for the functional advantage of lateral and top-down connectivity over the purely feed-forward 
structure in the memory formation and recall. First, the recurrent context-based connectivity seems to 
speed up the learning progress. Second, and at least as essential, recurrent configuration outperforms sig- 
nificantly the purely feed-forward configuration on the test views which deviate strongly from the original 
views shown during learning. This suggests that contextual processing is able to generalize over new 
data better than the purely feed-forward solution, which performs comparably on original or only slightly 
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deviating views. This outcome indicates that different processing strategies may prove more useful in dif- 
ferent situations. While the recurrent connectivity is mostly beneficial in novel situations, which require 
additional effort for the interpretation and learning of less familiar stimuli configurations, the feed-forward 
processing already suffices to do a good and quick job when facing well-known, overlearned situations, 
where effortful disambiguation is not required due to the strong familiarity of the sensory input. There, the 
feed-forward processing could benefit from the bottom-up pathway structures formed by previous expe- 
rience and evoke clear, unambiguous, easily interpretable activity patterns along the processing hierarchy 
without requiring additional contextual support from lateral and top-down connectivity. There are two pre- 
dictions arising from this outcome, which can be tested in a behavioral experiment involving subordinate 
level recognition tasks. First, deactivation of lateral and top-down connectivity in the IT would not change 
performance for overlearned content, but would impair recognition for less familiar instances of the same 
stimuli viewed under different conditions, the impairment being the more visible the stronger the viewing 
condition deviates from the overlearned one. Second, the same deactivation should lead to a measurable 
decrease in the learning speed, increasing the time needed to reach a certain low level of recognition error. 

Model predictions. There are some more predictions that can be derived from the system's behavior. 
One general prediction is that failed memory recall should be accompanied by the higher overall activa- 
tion along the IT processing hierarchy within the gamma or theta cycle, with the activity of the strongest 
units at cycle's peak being on contrary diminished. Reversely, a successful recall should be characterized 
by decreased overall activity in the IT and by increased activity in the winner units cluster. This is also 
interpretable in terms of signaling the degree of decision certainty, the successful recall being accompa- 
nied by greater certainty about the recognition result. Further, a failed recall should involve much more 
depression (LTD) than potentiation (LTP), a successful recall much more LTP than LTD on the active 
synapses. In addition, if required to memorize and distinguish very similar stimuli, the recall of such an 
item should lead to a higher overall activity in the IT network than for items with less similar appearance. 
The winner units, on contrary, should exhibit a reduced activation due to the inhibition originating from the 
competing similar content. Again, certainty interpretation of the activity level is possible here: the more 
similar the stimuli to be discriminated, the lower is the winner activation signaling the decision made, 
indicating lower certainty about the recognition result. An interesting prediction concerning the bidirec- 
tional plasticity mechanism is the erasure of a memory trace after repetitive stimulus-induced recall if 
LTD/LTP transition threshold is shifted to the higher values, for example due to an artificial manipulation, 
as performed in experiments of selective memory erasure in mice (Cao et al. 2008). 

So far, we provided a demonstration of experience-driven structure formation and its functional bene- 
fits in a basic core of what we think can be further developed into a full-featured, hierarchically organized 
visual memory domain for all kind of natural objects. As usual, several open questions remain, such as 
invariant or transformation-tolerant processing, development of a full hierarchy from elementary visual 
features to object categories and identities, establishing the interface for behaviorally relevant context as 
proposed in the framework of reinforcement learning, incorporating the mechanisms of active vision and 
so on. Nevertheless, with this work we hope we succeeded not only to highlight the crucial importance 
of coherent interplay between the bottom-up and top-down influences in the process of memory forma- 
tion and recognition, but also to gain more insight into the basic principles behind the self-organization 
( von der Malsburg[ 2003) of a successful subsystem coordination across different time scales. Aiming for 
real world applications, we believe that the incremental, unsupervised open-ended learning design instan- 
tiated in this work provides an inspiring and guiding paradigm for developing systems capable of discov- 
ering and storing complex structural regularities from natural sensory streams over multiple descriptional 
levels. 
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